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Abstract 

The cubic Klein-Gordon equation is a simple but non-trivial 
partial differential equation whose numerical solution has the 
main building blocks required for the solution of many other 
partial differential equations. In this study, the library 2DE- 
COMP&FFT is used in a Eourier spectral scheme to solve the 
Klein-Gordon equation and strong scaling of the code is ex¬ 
amined on thirteen different machines for a problem size of 
512^. The results are useful in assessing likely performance 
of other parallel fast Eourier transform based programs for 
solving partial differential equations. The problem is chosen 
to be large enough to solve on a workstation, yet also of in¬ 
terest to solve quickly on a supercomputer, in particular for 
parametric studies. Unlike other high performance comput¬ 
ing benchmarks, for this problem size, the time to solution 
will not be improved by simply building a bigger supercom¬ 
puter. 


1. INTRODUCTION 


The focusing nonlinear Klein-Gordon equation describes 
the evolution of a possibly complex scalar held u according 


to 


Uff — Au + u 


ImPm, 


( 1 ) 


where t is time and A = 3v 


+ ^yy + 3s 


the three-dimensional 


Laplacian. In this note we will focus on real valued func¬ 
tions M only. This equation can exhibit a phenomena known 
as blow up in which the held u becomes inhnite at a cer¬ 
tain point. Many partial differential equations which model 
physical phenomena can exhibit blow up, which typically in¬ 
dicates that the differential equation model is no longer a 
good model for the physical phenomenon under investiga¬ 
tion. Understanding under which initial conditions a blow 
up in the Klein-Gordon equation will occur is still not clear 
and it is hoped that parametric simulations can help to eluci¬ 
date this. Some parametric numerical studies of three dimen¬ 
sional radially symmetric real solutions to the Klein-Gordon 


equation have recently been done by Donninger and Schlag 


1 20111 . Their parametric study generated a large amount of 
data, which was unfeasible to store. Their study also made 
an assumption of spherical symmetry to be able to solve 
a one-dimensional equation for the radial component of a 
three-dimensional field. It is of interest to perform similar 
parametric simulations in three dimensions without symme¬ 
try assumptions for the Klein-Gordon and other partial dif¬ 
ferential equations. Eourier spectral methods are an effec¬ 
tive tool for doing this on petascale supercomputers. The 
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purpose of this paper is to conduct strong scaling experi¬ 
ments of a full model problem on different platforms, try¬ 
ing to understand how much faster a moderate size problem 
can be made to run. The Klein-Gordon equation is a simple 
model application for which different choices of numerical 
methods and computer architectures can be tried to deter¬ 
mine which one will give an accurate enough solution at ei¬ 
ther the fastest time, lowest computational or lowest energy 
cost. To limit the scope of this study, parallel fast Fourier 
transforms (FFT) are done with the library 2DECOMP&FFT 

2014| which primarily uses 


I Li and Laizet 2010||2DE 


MPI^LL_TO^LLV or MPT A T E TO ATE for its commu¬ 
nications. However, it would be interesting to perform a 
similar study with other parallel Eourier transform libraries 
(for example PEPT ||Pippig 


JDuyandOza^ 


2014||Ope 


2013IIPPP 2014 


2010[||PKU| |2014| and P3DFPT | |Pekurovsky 


201311 

20!4j, 

TT|P 


20141, PKUFFT |Chen et al. 


OpenFFT 


2012HP3D 


20141). The current focus of the study is on a discretiza¬ 


tion of 512^ since (i) there are still a wide variety of nu¬ 
merical experiments that can be done for problems of this 
size, (ii) post processing can be done locally on worksta¬ 
tions and (iii) for parametric studies, it is unclear whether 
access to high-performance computers or a distributed com¬ 
puting cloud based solution is most appropriate. It would also 
be interesting to try out other numerical methods, such as 
higher-order time stepping, a lattice-Boltzmann method 0 
et al.[ 20111 and an implicit hnite difference/hnite element 


spatial discretization approach with multigrid, fast multipole, 
tree code and/or preconditioned conjugate gradient solvers. 
For related work see |Gholami et al.|02O14| , from which it is 
clear that a careful choice of benchmarking solution is needed 
when comparing different discretization methods and elliptic 
system solvers. A later study will describe how such model 
could be used to rank computing systems as a possible addi- 


tion to the Linpack benchmark flMeuer et al. 

20141 and as an 

update of the NAS parallel benchmarks |Bai 

ey 

2011 [.There 


are already several attempts to generate benchmarks to sup¬ 
plement the Linpack ranking of supercomputers, such as high 
performance conjugate gradient | HPC[ 2014| and high per¬ 
formance multigrid [ [HPG| |^14| . However, neither of these 
solve a real world problem or allow for algorithmic improve¬ 
ments in methods for solving differential equations and lin¬ 
ear systems (see for example [ Ballard et al.l 20 14) a nd for 
Poisson’s equation [Demmel [1997 p. 277], | |Demmel[ 2014[ 
Lecture21]). This paper aims to start a discussion on a simple 
method to rank supercomputers for solving realistic problems 
that also allows for algorithmic improvements, emphasizes 
the entire computer eco-system, including software that can 
be used and developed for that system, and the people that 
operate the system. Ramachandran et al. j 2013| have already 
noted that better ways to evaluate accelerators are required. 
By stating a more general problem than solution of a linear 


system of equations, one can use the best architecture specihc 
algorithm on each platform thereby making a computer rank¬ 
ing not only a measurement of CPU double precision floating 
point power, but also of problem solving effectiveness. 

1.1. The Klein-Gordon Equation 

[Nakanishi and SchlagjPOI I) give an introduction to some 
of the theory of the Klein-Gordon equation, focusing primar¬ 
ily on the three-dimensional radial case. Two-dimensional 
simulations of the Klein-Gordon equation can be found in 
Bao and Yang 1 2007 1 and Yang |2006|. The linear Klein- 
Gordon equation occurs as a modification of the linear 
Schrodinger equation that is consistent with special relativ¬ 
ity, see for example Landau 1 1996) or Grennier 1 1984) . At the 
present time, there have been no numerical studies of blow up 
of solutions to this equation without the assumption of radial 
symmetry. This equation has generated a large mathematical 
literature and yet is still poorly understood. Most of this math¬ 
ematical literature has focused on analyzing the equation on 
an infinite three-dimensional space with initial data that either 
decays exponentially as one tends to infinity or is nonzero on 
a finite set of the domain. Here, we will simulate this equation 
in a periodic setting. Since this equation is a wave equation, 
it has a finite speed of propagation of information, much as 
a sound wave in air takes time to move from one point to 
another. Consequently for short time simulations, a simula¬ 
tion of a solution that is only nonzero on a finite part of the 
domain is similar to a simulation on an infinite domain. How¬ 
ever, over long times, the solution can spread out and interact 
with itself on a periodic domain, whereas on an infinite do¬ 
main, the interaction over long times is significantly reduced 
and the solution primarily spreads out. Understanding the in¬ 
teractions in a periodic setting is an interesting mathematical 
problem. Sufficiently smooth solutions of the Klein-Gordon 
equation conserve the energy given by 


f 1 

E{u,ut) = / -|m,| 




When accurate time stepping schemes are used for suffi¬ 
ciently bounded and differentiable solutions, the energy can 
act as a test of the correctness of the code implementation, the 
libraries used and of the computer hardware, since if the en¬ 
ergy is not conserved and the implementation is correct, there 
is likely to be a hardware or library error. 

1.1.1. Numerical Schemes 

The two time stepping schemes that were used by |Don- 
ninger and Schlag | 2011| for radially symmetric solutions, 
can be readily adapted to fully three-dimensional simulations 
using Fourier pseudospectral discretization instead of a finite 
difference spatial discretization. One of these schemes is sim¬ 
ple to implement, and a modification of this for implicit time 


























































































































stepping is described below since it is typical of numerical 
methods used for finding approximate solutions to partial dif¬ 
ferential equations and can be modified for use with other grid 
based spatial discretization methods. 


1.1.2. A Second-Order Scheme 

A modihcation of a second-order scheme used by Don- 


[ninger and Schlag]|2011) and implemented in this study is 

u"+' +2u" + u"-' 

-A-^- 


5f2 

, + 2 „„ 
+ ^ — |M I M , 


( 2 ) 


where u" Ki u{ndt,x,y,z). Time stepping takes place in Fourier 
space where the linear elliptic equation from the semi- 
implicit time discretization is easy to solve, and the three 
dimensional Fourier transform is used to obtain the nonlin- 


ear term in real space, 
method can be found in 

A more detailed explanation of the 

Cloutier et al. |2012|, Rigge|||2012| 

and Balakrishanan et al. 

|2014 . Example Matlab and Python 


implementations of this method, as well as the parallel code 


can be found in |Balakrishanan et al. 102014) . This scheme re¬ 
quires two FFTs per time step. The implementation used in 
this study allows for the held u to be complex. 


2. RESULTS 

We compare the scalability of the numerical scheme, with¬ 
out output to disl|^ In all cases the wall clock time for 30 
time steps was measured, taken. Figure shows strong scal¬ 
ing results and Table [T] lists the computers according to the 
shortest run time, as well as documenting the properties of 
each computer. 

2.1. Performance Model 

Before discussing the results, it is helpful to have a model 
for how fast parallel computation can make this computa¬ 
tion. I Williams et al.| 12009[ introduced the roohine model 
which provides a simple upper bound for performance of 
an algorithm at the node level. Fast Fourier transforms are 
typically bandwidth limited, the loop based computations 
in the code are performed on long vectors with unit stride 
accesses and little re-use of values loaded from memory, 
and so are similarly bandwidth limited at the node level. 
For each Fourier Transform, two MPIW^LL_TO_ALL_V or 
MPI_ALL_TO_ALL calls are used. For small message sizes, 
these are usually latency limited, while for large message 
sizes, these are usually network bandwidth limited. Since 

'Except for the VSC2 for which no noticeable difference in runtime was 
observed when doing output to disk, and when not doing output to disk. 


the problem size considered here is not too large, a sim¬ 
ple model would use on chip bandwidth and network la¬ 
tency to determine an estimate of performance and scalabil¬ 
ity. Let a single core have a bandwidth of B^, and the min¬ 
imum network latency for a single byte be L„. For a dis¬ 
cretization of size grid points, each time step requires 
approximately diN^ double precision floating point opera¬ 
tions and 2rf2[Alog(A)]2 operations for the FFT where di 
and d 2 are constants. Hence, on a single core, the runtime is 
approximately given by +2d^[A'iog(Af)] ^ ^ 

cesses the runtime is +2t^[^iog(A')] ^ assuming no network 
communication costs. When there is network latency, we get 
diN + 2 t^[OTog(A')] Thus, the network latency gives a 

lower bound on the possible speedup. It also takes time to 
send messages across a network. This depends on the topol¬ 
ogy of the network being used and the algorithm used to per¬ 
form the MPI_ALL_TO_ALL_V or MPI_ALL_TO_ALL ex¬ 


change Bertsekas et al. 119911, Grama et al. |2003 p. 537], 
ISwarztrauber | I987| , Swarztrauber and Hammondj | |200I| . 
Since the problem size is not too large, it is reasonable to as¬ 
sume some small network dependent constant dj, multiplied 
by the natural logarithm of the total number of processes (the 
lower bound for a hypercube network that is well suited to the 
FFT) so that we obtain +tt^[^iog(A')l log(p). The 

model is similar though not as detailed as the ones in [Ayala 


and Wang |20131, Kerbyson and Barker 1 201 1| and |Kerbyson 
et al. 1 2013|. Since different runtime FFT algorithms on each 


machine are used, and since the slab decomposition is used 
for a small number of process and the pencil decomposition 
is used for more than 512 processes, and since there are dif¬ 
ferent algorithms for performing an all to all exchange, the 
best of which will depend on the size of the problem being 
solved, the computer being used and the number and loca¬ 


tion of the processors being used on that computer (see Fos- 


|ter and Worley] | |1997[ ), a more detailed model is not devel¬ 
oped. For small p and fixed N, the runtime decreases close 
to linearly, and once p is large enough the runtime starts to 
increase again due to communication costs. Since most of the 
computers used in this study are not hypercubes, the model 
can only provide a lower bound of the time to solution with¬ 
out communication and computation overlap. Given the large 
number of computers used, and the fact that even on a single 
computer, process placement, software environment conhg- 
uration, as well as communications by other compute jobs 
would affect program performance, it is infeasible to give a 
more complete model for the communication cost. It should 
be noted, that such a model may also be applicable to accel¬ 
erated computers, if there is sufficient network support and 
a good implementation, see for example Chen et al. |2010 , 
[Czechowski et al. 10201 l| , [Czechowskiet ah 1 20121 , Park eta ]] 
]2013| , [Song and Hollingsworth| |2014| and | Swarztrauber] 




































































































Table 1. A table showing ranking of speed at which different computers solve the Klein-Gordon equation for a grid size of 512^. The systems used, their 
compute chips, interconnect and underlying one dimensional FFT library used by 2DECOMP&FFT, theoretical chip bandwidth from RAM and theoretical 
peak floating point performance are also shown. Beacon, Stampede and Titan have accelerators which were not used in the study, hence their properties are not 
listed nor used in the calculation of theoretical peak double precision floating point performance. 
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Figure 1. Scaling results showing computation time for 30 time steps as a function of the number of processor cores. 


1 1997| . The model does not account for overlap of compu¬ 
tation and communication in computing a distributed Fourier 
transform, which have been tried by Kandalla et al. I POTTI 
and Hoefler and Zerah [2007 1 (MPI 3.0 non blocking collec¬ 
tives make this easier to implement), for which one might be 
able to decrease the lower bound. The work here is similar 
in spirit to that in Worley and Foster 1 1994| , but the equation 
used is simpler, and hence easier to use for evaluating com¬ 
puter performance. 


2.2. Result Summary 

Table [T] shows that for this problem, a ranking can be 
obtained that would allow for easy evaluation of computer 
architectures for solving partial differential equations of a 
given discretization in the fastest time possible using algo¬ 


rithms that are dominated by Fourier transforms. Figure [T] 
shows that the strong scaling limit is not reached on all the 
platforms primarily due to clusters being too small or due 
to queue size restrictions. For cases where less cores were 
used than close to full system size and the strong scaling 
limit was not reached, further computation time is required, 
in particular on Titan and Hornet. It is likely that the re¬ 
sults in the table can change with extensive tuning and care¬ 
ful job placement, however they are representative of what 
occurs in a typical production environment. Figure shows 
scaling, but rather than using core count, uses total proces¬ 
sor bandwidth which is defined as bandwidth per node mul¬ 
tiplied by number of nodes used. Czechowski et al. |2012| 
and Marjanovic et al. | 2014| indicate that node level band¬ 
width may be more important than system interconnect band- 
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Figure 2. Scaling results showing computation time for 30 time steps as a function of total on chip bandwidth defined as the 
maximum theoretical bandwidth from RAM on a node multiplied by the number of nodes used. 


width, and indicate that for benchmarks which look at a 
hxed problem size, a few key performance indicators can re¬ 
place the benchmark - for the current code, memory band¬ 
width and performance of MPI^LL_TO_ALL are good indi¬ 
cators and for HPCG| [HPC| |20T4| , [Marjanovic eTak] 
indicate that memory bandwidth and MPI_ALL_REDUCE 
time are good performance indicators. Good performance for 
MPl_ALL_TO_ALL is much harder to achieve than good per¬ 
formance for MPI^LL_REDUCE and will in particular im¬ 
ply good performance for MPI_ALL_REDUCE, though likely 
at a higher hardware cost. 


2.3. Result Discussion 

Table[T]shows that Hornet produces the fastest run time due 
to its high performance communication network and fast pro¬ 
cessors. Juqueen does however have a much higher theoret¬ 
ical peak floating point performance than Hornet. Juqueen’s 
large number of processor cores are difficult to use efficiently 
for this problem size, thus despite Hornet’s smaller size, it is 
more effective at solving the Klein Gordon equation using the 
current algorithm. Similar behavior is observed on Marenos- 
trum 111, where network performance makes it difficult to uti¬ 
lize a large portion of the machine despite the high theoretical 
peak performance. On Aquila, there is a pronounced drop in 
performance when going from 8 to 16 cores, and after this the 
scaling is close to ideal again relative to the 16 core run (and 















the efficiency for 256 cores would be 76.5% if it was mea¬ 
sured relative to the 16 core results). This drop in performance 
is likely because the 8 core run only requires intra-node com¬ 
munications, whereas all runs with higher core counts have to 
send messages via the slower infiniband interconnect. A sim¬ 
ilar drop in performance is observed on Hornet in going from 
2048 cores to 3072 cores, since MPI communication requires 
two communication steps on the dragon fly topology rather 
than just one communication step. On Beacon, the runtime 
was quite sensitive to process placement, likely due to net¬ 
work topology. Neser is the oldest machine used in this study. 
It is a Linux cluster with a 1Gb ethernet network that is still 
fast at small core counts. Neser’s low cost gigabit ethernet 
network prevents good scaling behavior to larger core counts 
and explains its low ranking in table [T] Speed up on Shaheen 
was very close to ideal due to the fast interconnect and bal¬ 
anced design which allows for good throughput to the cores 
given their maximum floating point performance. Shaheen 
has 0.85 GHz cores and Juqueen has cores that are clocked 
at 1.6GHz and can do twice as many floating point operations 
per cycle, yet for the same number of cores, Juqueen is on 
average only 1.4 times faster than Shaheen, this is likely be¬ 
cause each core on Juqueen has a lower share of bandwidth 
(2.66Gb/s) than on Shaheen (3.375 Gb/s). The performance 
on Vedur is significantly worse than on Hector, despite hav¬ 
ing the same compute chips. This is likely due to interconnect 
latency. Finally, fig. also shows that newer Intel processors 
on Beacon, Hornet, Marenostrum and Stampede give much 
better performance than the Power PC, older Intel and AMD 
processors for the same node level bandwidth. The reason for 
the improved performance is likely to be due to the larger 
number of floating point operations that can be done per cy¬ 
cle compared to the other architectures, and the more sophis¬ 
ticated memory controllers - this indicates that simply know¬ 
ing chip bandwidth and performance of MPI^LL_TO^kLL 
allows for a simple but incomplete model. As explained by|LoJ 
|et al.| | |2OT4l , these characteristics of a machine are also dif¬ 
ficult to measure precisely so their use for predictions should 
be done with care, though they may give a good but not per¬ 
fect initial approximation. 

3. FUTURE WORK 

The results in this paper give a guide for codes which heav¬ 
ily utilize the Fourier transform on the different combinations 
of processor and interconnect that will give the best overall 
computation time. There are a variety of other methods for 
solving the same equation, and other aspects of using super¬ 
computers that have not been covered in the present study, but 
would be useful to cover in other studies. Possible future work 
includes: finite difference/finite element codes using multi¬ 
grid, tree code, fast multipole or conjugate gradient linear 
equation solvers for the implicit linear system solve in the 


time discretized Klein-Gordon equation, high order timestep¬ 
ping, in-situ visualization, measurement of computer energy 
consumption, use of accelerators, effectiveness of input and 
output, and more detailed performance models can be done 
in further work on a smaller set of computers. Care will be 
required in choosing initial conditions to allow for a mean¬ 
ingful comparison with other ways of discretizing this equa¬ 
tion and solving the linear equations in implicit time stepping 
schemes. 
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